Application of interpretable machine learning algorithms to predict distant metastasis in ovarian clear cell carcinoma

Abstract Background Ovarian clear cell carcinoma (OCCC) represents a subtype of ovarian epithelial carcinoma (OEC) known for its limited responsiveness to chemotherapy, and the onset of distant metastasis significantly impacts patient prognoses. This study aimed to identify potential risk factors contributing to the occurrence of distant metastasis in OCCC. Methods Utilizing the Surveillance, Epidemiology, and End Results (SEER) database, we identified patients diagnosed with OCCC between 2004 and 2015. The most influential factors were selected through the application of Gaussian Naive Bayes (GNB) and Adaboost machine learning algorithms, employing a Venn test for further refinement. Subsequently, six machine learning (ML) techniques, namely XGBoost, LightGBM, Random Forest (RF), Adaptive Boosting (Adaboost), Support Vector Machine (SVM), and Multilayer Perceptron (MLP), were employed to construct predictive models for distant metastasis. Shapley Additive Interpretation (SHAP) analysis facilitated a visual interpretation for individual patient. Model validity was assessed using accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and the area under the receiver operating characteristic curve (AUC). Results In the realm of predicting distant metastasis, the Random Forest (RF) model outperformed the other five machine learning algorithms. The RF model demonstrated accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and AUC (95% CI) values of 0.792 (0.762–0.823), 0.904 (0.835–0.973), 0.759 (0.731–0.787), 0.221 (0.186–0.256), 0.974 (0.967–0.982), 0.353 (0.306–0.399), and 0.834 (0.696–0.967), respectively, surpassing the performance of other models. Additionally, the calibration curve's Brier Score (95%) for the RF model reached the minimum value of 0.06256 (0.05753–0.06759). SHAP analysis provided independent explanations, reaffirming the critical clinical factors associated with the risk of metastasis in OCCC patients. Conclusions This study successfully established a precise predictive model for OCCC patient metastasis using machine learning techniques, offering valuable support to clinicians in making informed clinical decisions.


| INTRODUCTION
Epithelial ovarian cancer (EOC), often referred to as ovarian cancer, comprises a heterogeneous group of diseases characterized by distinct genomic features. 12][3] Based on their unique pathological characteristics, epithelial ovarian cancers can be classified into several subtypes, including ovarian serous carcinoma (OSC), ovarian mucinous carcinoma (OMC), ovarian endometrioid carcinoma (OEC), and ovarian clear cell carcinoma (OCCC). 3,4OCCC is distinguished by the presence of transparent cytoplasm. 5,6OCCC accounts for approximately 5%-20% of all ovarian cancer cases, with notable variations in incidence across different racial and geographical groups.Its prevalence is approximately 3.1% in the black population, 4.8% in the white population, and 11.1% among individuals of Asian descent. 7,8Notably, the Asian region, particularly Japan, exhibits the highest incidence of OCCC, with rates reaching up to 25%. 9 While OCCC is relatively straightforward to diagnose at an early stage, its inherent insensitivity to platinumbased chemotherapy regimens contributes to a heightened risk of infiltration, metastasis, and relapse after treatment, resulting in a markedly unfavorable prognosis. 10Treatment modalities for OCCC recurrence and metastasis include secondary tumor cell reduction (SCS) and nonplatinum single-agent chemotherapy, sometimes in combination with bevacizumab. 11,12Clinical data have indicated that there is no substantial disparity in the 5-year overall survival rate between 144 patients with recurrent metastatic OCCC who did not undergo SCS and 25 patients who did. 13 Another study reported a notably low response rate of just 33% among 25 patients with platinum-resistant OCCC who received secondary nonplatinum chemotherapy. 14Hence, the identification of risk factors for OCCC metastasis and recurrence, along with the development of a metastasis prediction model, is of paramount importance in improving the survival prospects of OCCC patients.This research aims to extract clinical characteristics and treatment data of OCCC patients diagnosed between 2004 and 2015 from the SEER database.The study endeavors to analyze the factors influencing distant metastasis in OCCC patients and assess the prognostic indicators for patients with distant metastasis.Additionally, the study will construct nomograms to predict the risk and prognosis of distant metastasis in OCCC cases, providing valuable insights for guiding the clinical management and individual survival prognosis of patients with distant metastasis of OCCC.

| Database
The patient data utilized in this study was sourced from the SEER database (http:// seer.cancer.gov/ ), SEER database is one of the authoritative large-scale tumor registration databases in the United States.We retrievaled the clinical case data for OCCC patients aged 25 years or older from 2004 to 2015 from SEER database by SEER*Stat software (v 8.4.2), and export the data list to Excel table for the purposes of data screening (Figure 1).

| Study population
We conducted a comprehensive data collection process spanning the years 2004-2015, involving 66,986 individuals diagnosed with ovarian cancer, as delineated by the International Classification of Diseases Version 10 (ICD-10) code in the SEER database, signifying a confirmed diagnosis of ovarian cancer.To establish a well-defined and homogeneous cohort for subsequent analysis, we established specific inclusion and exclusion criteria.The inclusion criteria necessitated those individuals be at least 25 years of age, possess ovarian clear cell carcinoma (OCCC) as their primary tumor, maintain a comprehensive record of essential personal and disease-related information, exhibit clarity in terms of metastatic status, and have their diagnosis confirmed through pathological examination utilizing the ICD-O-3 HIST/BEHAV code "8310".Conversely, exclusion criteria encompassed patients below 25 years of age, those lacking accurate metastatic information, individuals with a primary tumor other than OCCC, those without complete information records, and patients whose diagnosis relied on autopsy or a death certificate.Data extraction flowchart is shown as Figure 1.
This meticulous selection process resulted in a final cohort of 66,986 ovarian cancer patients, of which 3220 individuals had OCCC as their primary tumor.Among these, 1449 cases have metastases information.Our analysis incorporated demographic variables such as age, marital status, and race, in addition to clinicopathological and treatment-related parameters, including tumor laterality, tumor size, preoperative serum CA125 levels, tumor differentiation (grades I, II, III, and IV, respectively, represent the degree of tumor differentiation as well differentiated, moderately differentiated, poorly differentiated, and undifferentiated.we define levels I and II as a group, and levels III and IV as a group), TNM stage, residual lesion size, surgical procedures, radiotherapy, and chemotherapy.

| Weighting and selection of variables for analysis
Variable selection, namely the ranking of variable importance (weights), plays a crucial role in various predictive problems, encompassing dataset analysis and variable selection processes.Traditional methods of importance ranking primarily fall into two main categories: those based on model coefficients and those based on model performance (e.g., importance analysis using univariate numerical permutation). 15This study adopts the first category, which relies on model coefficients, to analyze variable importance.It employs machine learning techniques to rank the variables after the initial screening, thereby further extracting significant features.Three machine learning methods, namely GNB and Adaboost, are used for feature importance ranking.

| Description of multiple machine learning algorithms
XGBoost is a supervised learning algorithm that combines weak regression trees to enhance predictive performance while controlling model complexity, utilizing a Taylor expansion focusing on the first and second derivatives of the error function.Applied in medicine, XGBoost addresses overfitting and reduces computational load. 16ightGBM, a Microsoft-provided algorithm, is an iterative boosting tree system that improves upon traditional gradient boosting decision trees by utilizing both firstand second-order negative gradients.It incorporates a Histogram-based decision tree algorithm for faster execution and adopts a leaf-wise growth strategy for increased efficiency. 17Random Forest (RF) is an ensemble learning algorithm that constructs a classification model using multiple decision trees.It employs a majority rule voting mechanism, providing high accuracy, resistance to interference, and ease of implementation but comes with a higher computational burden. 18Adaboost is an iterative algorithm that boosts weak classifiers by updating the weights of training samples, emphasizing misclassified samples in each iteration.It creates multiple weak classifiers, which are combined into a strong classifier through weighted summation. 19Support Vector Machine (SVM) is a classification algorithm minimizing structured risk for improved generalization, acting as a binary classifier and defining a maximum-margin linear classifier in the feature space.SVM transforms into solving a convex quadratic programming problem, ensuring robust statistical patterns even with limited sample sizes. 20Multilayer Perceptron (MLP) addresses linearly inseparable problems by stacking multiple layers of linear classifiers with nonlinear activation functions.It includes input, hidden, and output layers characterized by weights, biases, and activation functions, commonly employing the Sigmoid function for nonlinear mappings within the specified output range.for comparing qualitative data.The top 10 most critical risk factors for distant metastasis of ovarian cancer were identified through the application of GNB and Adaboost, and their eight intersection points were selected through Venn diagrams.Additionally, six machine learning classification algorithms, including XGBoost, LGBM, RF, AdaBoost, SVM, and MLP, were employed to establish a classifier model for distant metastasis.  1.

| The basic characteristics of metastatic and nonmetastatic data
Among a total of 1449 cases, 96 exhibited distant metastasis, while 1353 cases showed no signs of distant metastasis.Table 2 presents a comprehensive demographic and clinical analysis based on the criterion of distant metastasis.

| Correlation analysis
Spearman correlation analysis was employed to investigate the interrelationships among the variables.The correlation heatmap (Figure 2) illustrated the correlation between each factor, like the correlation coefficient between N and T is 0.36, which is less than 0.5, indicating a weak correlation, and the others are also weak correlations.

| Feature variable selection
The optimization of feature variables was conducted through the application of machine learning algorithms, namely GNB (Figure 3A) and Adaboost (Figure 3B).Each algorithm was employed to identify the top 10 most important feature variables for their respective models.Subsequently, utilizing Venn diagrams, a comprehensive analysis led to the identification of eight variables (Grade, CA125, Surgery, T, Residual Tumor Volume, N, Laterality, and Tumor size) for the construction of the model (Figures 3C).

| The predictive performance and calibration of machine learning models
In order to establish a predictive model for distant metastasis in ovarian clear cell carcinoma (OCCC) based on machine learning algorithms, we use the eight features (Grade, CA125, Surgery, T, Residual Tumor Volume, N, Laterality, and Tumor size) identified through screening as independent factors.The algorithms employed include XGBoost, LGBM, RF, AdaBoost, SVM, and MLP.To mitigate overfitting and select the optimal model, 10-fold crossvalidation was performed using the training set, yielding average values for accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and AUC for the six machine learning models (Table 3).favorable net clinical benefit, confirming its excellent performance in the test set (Figure 6B).Utilizing SHAP summary plots (Figure 7A), we computed the contribution of each feature to the model output to identify the most relevant predictive factors.The SHAP importance plot (Figure 7B) further elucidates the impact of individual features on the model.

| DISCUSSION
EOC is a complex and heterogeneous group of diseases characterized by diverse genomic features. 22OCCC, a distinct subtype of EOC, poses significant challenges in terms of chemotherapy resistance and poor prognosis. 23his research focuses on leveraging machine learning algorithms to analyze clinical data from OCCC patients and construct predictive models for distant metastasis.The study aims to identify risk factors and prognostic indicators, providing valuable insights for enhancing the clinical management and survival prognosis of OCCC patients with distant metastasis.
The prevalence of OCCC varies across racial and geographical groups, with the Asian population, particularly in Japan, exhibiting the highest incidence rates.OCCC was not easily detected in the early stages.Considering the high incidence of lymph node metastasis in this subtype, early OCCC requires extensive staging, including pelvic and para aortic lymph node dissection.
OCCC's resistance to platinum-based chemotherapy contributes to increased risks of infiltration, metastasis, and relapse, leading to unfavorable prognoses. 24This study's emphasis on identifying risk factors for metastasis is crucial, as it addresses the pressing need to improve the survival prospects of OCCC patients.
To identify the most critical risk factors, machine learning algorithms, including GNB and Adaboost, were employed. 25The Venn test was used for further refinement.The most important eight risk factors were obtained, including Grade, CA125, Surgery, T, Residual Tumor Volume, N, Laterality, and Tumor size, which suggested that clinical attention should be paid to improving and recording these indicators in patients with OCCC for the assessment of metastasis risk.We further evaluated the predictive performance and calibration of machine learning models, including XGBoost, LGBM, RF, AdaBoost, SVM, and MLP.The Random Forest model demonstrated superior performance, surpassing other models in terms of accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and AUC.The thorough evaluation of the model's performance using metrics and visualizations, such as ROC curves, calibration curves, and decision curve analysis, adds credibility to the study's findings.Previous studies have suggested that most EOCs are characterized by peritoneal disseminated metastasis.For advanced EOCs, neoadjuvant chemotherapy (NACT) can be considered to reduce tumor volume, reduce surgical differences, and improve surgical success rates.Recently, new clinical treatment methods have been developed, including immune checkpoint blockade therapy, targeted angiogenesis therapy, the use of ARID1A synthesis for lethal interactions, and targeting liver cell nucleus factor 1 β New therapies such as ferroptosis bring great hope to ovarian cancer patients. 26ompared to imaging recognition of distant metastases and staging diagnosis, model provides a new approach.This algorithm is from a statistical perspective, using an RF model and introducing relevant variables to determine the probability of distant metastasis.
The application of the Random Forest model as the final classification model for the test set yielded promising results, emphasizing its potential utility in predicting distant metastasis in OCCC patients.The accuracy, sensitivity, specificity, and other metrics demonstrated the model's ability to provide valuable insights into patient prognosis.
This research contributes to the field of ovarian cancer research by leveraging machine learning techniques to identify and understand the factors influencing distant metastasis in OCCC patients.The predictive models developed in this study have the potential to assist clinicians in making informed decisions regarding the clinical management of OCCC patients, ultimately improving survival outcomes.Future studies could focus on validating these models using external datasets and exploring additional factors as postoperative complications that may contribute to the metastatic behavior of OCCC.Meanwhile, there are perhaps more pressing clinical questions that this could be applied to, such as preoperative risk stratification for malignancy like the O-RADS system, determining likelihood of a complete gross resection, predicting postoperative complication rates, predicting platinum resistance, etc.

| CONCLUSION
In conclusion, the study successfully leveraged machine learning algorithms, particularly the Random Forest model, to develop a predictive model for distant metastasis in OCCC.The robust performance of the model suggests its potential clinical utility in guiding treatment decisions and improving outcomes for OCCC patients.Further validation and refinement of the model could contribute to its integration into clinical practice for personalized care.

F I G U R E 2 F I G U R E 3
A heatmap representation of the Spearman correlation matrix of the variables.Relevant correlations are color-coded based on the strength of the correlation.The results of GNB (A) and Adaboost (B) machine learning algorithms filter the top 10 important variables.The results are expressed by coefficient value.(C) Venn analysis of the results of the above two machine algorithms.
The results indicate that, for the validation set, the RF model demonstrated superior predictive performance with accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and AUC (95% CI) of 0.792 (0.762-0.823), 0.904 (0.835-0.973), 0.759 (0.731-0.787), 0.221 (0.186-0.256), 0.974 (0.967-0.982), 0.353 (0.306-0.399), and 0.834 (0.696-0.967), respectively, surpassing other machine learning models.The performance of each model in the training and validation set is depicted in the Table3, and the ROC curves for the six models in the training set (Figure4A) and validation set (Figure4B) are illustrated.The comparison of multiple machine learning evaluation indicators in the validation set is shown in Figure5.Subsequently, the calibration curve of the RF model was analyzed, demonstrating alignment with the diagonal line, indicative of excellent performance in the test set (Figure6A), with a Brier Score of 0.038.The decision curve analysis (DCA) curve for the RF model also exhibited F I G U R E 4 ROC curve comparison of training set (A) and Validation set (B) in multiple machine algorithms.F I G U R E 5 Comparison of multiple machine learning evaluation indicators in the validation set.

F
I G U R E 6 (A) RF model test set calibration curve.(B) RF model test set decision curve analysis (DCA).| 11 of 12 GUO et al. 21

2.5 | Statistical analysis and model evaluation
Demographic and clinicopathologic variables within the entire cohort stratified based on training and test sets.Demographic and clinicopathologic variables within the entire cohort stratified based on metastasis status.
Concerning baseline characteristics, we utilized Student's ttest or Mann-Whitney U test to compare quantitative data, while Fisher's exact test or chi-squared test was employed T A B L E 1 Note: M1: represents a patient with distant metastasis.| 5 of 12 GUO et al.T A B L E 2 Comparison of multiple machine learning evaluation indexes between training set and test set.
T A B L E 3